Audio Alchemy
INFO 523 - Final Project
Abstract
Music recommendation systems increasingly rely on machine learning to capture the complexity of user preferences, yet existing models struggle to account for language diversity and nuanced audio features in songs. This project applies signal processing, vocal separation (DEMUCS library), and machine learning techniques to classify song languages and integrate them with genre metadata for improved personalization. By combining automated data collection with advanced audio analysis, the system provides a foundation for smarter, more inclusive recommendation platforms that enhance user experience across diverse musical contexts. The project first applied Random Forests and Gaussian Mixture Models with 5-fold cross-validation for audio genre identification, then advanced to CNNs on spectrogram heat maps validated via a train/validation/test split with early stopping, evaluated through accuracy, precision, recall, F1-score, confusion matrices, and ROC curves.
Nathan #_to_Do
(Nathan) - sentence about results.
By applying statistical and time-frequency features to separated vocal and instrumental tracks, we evaluated the feasibility of machine learning models in song language recognition. Classical approaches such as Logistic Regression, Random Forests, and SVMs were trained with 5-fold cross-validation. Results showed that vocal-only features provided the strongest signals for classification. The analysis demonstrated that language prediction from raw audio is viable when leveraging targeted feature engineering.
(Yashi) - sentence about Part 2.
(Yashi) - sentence about results.
Introduction
Music genre classification is a central task in the field of music information retrieval, combining elements of signal processing, machine learning, and deep learning. Accurate genre identification not only enhances music recommendation systems and streaming platforms but also deepens our understanding of audio structure and human perception of sound. Traditional approaches have relied on handcrafted audio features analyzed with machine learning techniques such as Random Forests and Gaussian Mixture Models, offering interpretable yet limited performance[1]. Recent advances, however, leverage deep learning methods—particularly convolutional neural networks (CNNs)—to extract high-level representations directly from spectrograms, achieving state-of-the-art results[2]. This project explores both paradigms: first applying classical machine learning with 5-fold cross-validation, and then advancing to CNN-based classification on spectrogram heat maps, with results evaluated using standard metrics including accuracy, precision, recall, F1-score, confusion matrices, and ROC curves.
A note on spectographic features of .mp3 vs. .wav
Nathan #_to_Do
Here i’ll add a graph, and discuss who despite the large disparity in file size the two file types are nearly similar in audio data. Making it not a good research path to compare .mp3 and .wav files.
Questions
1. Language Recognition with Separated Vocal & Audio Tracks
initial formulation
How can we leverage statistical and time-frequency features extracted from separated vocal and audio tracks to build effective language recognition models? Specifically, how can traditional machine learning methods — ranging from classical classifiers on simple statistical summaries to Gaussian Mixture Models on richer time-frequency features — be applied in this context?
- What are the key benefits and limitations of these approaches?
- How can careful feature engineering, feature integration, and thorough model evaluation improve the accuracy and robustness of language recognition systems?
- How do model results compare and contrast when using .wav files versus .mp3 files?
secondary formulation
From the initial formulation, we refined the question to specifically compare how different ablations of the audio track (complete song, vocal-only, and non-vocal) affect model performance.
How does model performance differ when predicting song language using features from complete songs, vocal-only tracks, and instrumental-only tracks?
What are the relative strengths and limitations of classical machine learning models (Logistic Regression, Random Forest, SVM) when applied to language recognition?
From the initial problems statement - what did you change?
2. Recommendation Systems Using Audio Features & User Data
initial formulation
How can user interaction data, combined with basic track metadata and simple audio features, be used to build an effective recommendation system using collaborative filtering and traditional machine learning methods?
- Furthermore, how can advanced audio features, dimensionality reduction, and clustering techniques improve personalized recommendations by better capturing user preferences and track characteristics from both vocal and non-vocal components?
- How do recommendation model results compare and contrast when using .wav files versus .mp3 files, considering the potential impact of audio quality and compression artifacts on feature extraction and recommendation performance?
secondary formulation
Nathan #_to_Do
From the initial problems statement - what did you change?
Dataset
data provenance
Nathan #_to_Do
software distriubtion
Nathan #_to_Do
data collection
Nathan #_to_Do
data storage
Nathan #_to_Do
graphs?
Nathan #_to_Do
.. any feature graphs I can think of…
Team member workload
Our project workload followed a structured week-by-week workflow, with responsibilities distributed among team members. We began by finalizing and sharing the proposal, followed by the individual collection and organization of ~200 audio files per person. Nathan Herling led the processing and validation of metadata, while each member focused on building machine learning pipelines and conducting iterative testing. The project concluded with a collaborative effort on final model evaluation, report preparation, and presentation development.
Just type 3-4 sentences about workload go off the proposal ‘week work map/individual duties section’ - but, put into into paragraph form.
Problem analysis and results
General
Nathan #_to_Do
Discuss what the original plan was, and what was done - in terms of easy, med problem design.
Q1 - Yashi
How can we leverage audio features from separated vocal and instrumental tracks to improve language recognition in music?
Data Collection: The dataset consisted of ~200 audio files, preprocessed into three ablations: complete songs, vocal-only tracks, and instrumental-only tracks. Features included time-domain statistics (mean, variance, skewness, kurtosis).
Data Processing: All features were standardized using global scaling. Encoded target variable (language) with LabelEncoder.No major imputation was required as missingness was minimal.
Model Selection: I evaluated three models: Logistic Regression, Random Forest, and Support Vector Machines with linear kernels. These models were chosen for their balance of interpretability, robustness, and suitability for structured feature data. Training and evaluation were conducted using 5-fold stratified cross-validation to ensure reliable performance comparisons across models.
Validation & Metrics: Evaluation focused on accuracy, precision, recall, and F1-score. Confusion matrices were used to analyze per-class misclassification patterns.
Model Evaluation:
| Ablation | Model | Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|---|
| complete_song | LogReg | 0.399 | 0.398 | 0.447 | 0.387 |
| complete_song | RandomForest | 0.626 | 0.469 | 0.410 | 0.401 |
| complete_song | SVM_linear | 0.432 | 0.443 | 0.504 | 0.427 |
| vocal_only | LogReg | 0.560 | 0.531 | 0.567 | 0.509 |
| vocal_only | RandomForest | 0.552 | 0.426 | 0.404 | 0.385 |
| vocal_only | SVM_linear | 0.544 | 0.542 | 0.583 | 0.514 |
| no_vocal | LogReg | 0.333 | 0.371 | 0.364 | 0.316 |
| no_vocal | RandomForest | 0.577 | 0.436 | 0.349 | 0.328 |
| no_vocal | SVM_linear | 0.366 | 0.418 | 0.411 | 0.347 |
| Column Min | - | 0.333 | 0.371 | 0.349 | 0.316 |
| Column Max | - | 0.626 | 0.542 | 0.583 | 0.514 |
Results:
Vocal-only tracks: Provided the best classification signal, with SVM achieving ~0.51 macro F1, outperforming Random Forest and Logistic Regression.
Complete songs: Models achieved moderate performance (~0.40 F1), reflecting a mixture of useful vocal cues diluted by instrumental content.
Non_vocal tracks: Accuracy dropped to ~0.50 (random baseline), validating the expectation that language recognition requires vocal content.
- restate question
- data collection - data set size/composition (var types) Data was collected through a series of Python scripts. [see slide 3] You had an extra layer of feature extraction - removing vocal/instrumental tracts.
- data processing - any PCA, correlation, imputation, outlier removal?
- model selection - what models, why? (if any reason), what Python libraries did you use?
- model validation - what metrics? why?
- model evaluation - what metrics? why? Note: the ‘no vocal’ track - behaved at about 50% accuracy - which is what you’d expect for a control group.
- future steps/recommendations
Q2
Nathan #_to_Do
- put any graphs in this section.
- do you have learning curve, ROC curves? Feature importance graphs?
- restate question
- data collection - data set size/composition (var types) Data was collected through a series of Python scripts. [see slide 3] You had an extra layer of feature extraction - removing vocal/instrumental tracts.
- data processing - any PCA, correlation, imputation, outlier removal?
- model selection - what models, why? (if any reason), what Python libraries did you use?
- model validation - what metrics? why?
- model evaluation - what metrics? why?
- future steps/recommendations
Results & Conclusion
Yashi
The project goal was to investigate how audio feature engineering can support language classification in music. Our analysis demonstrated that vocal-only features drive most of the predictive signal, confirming that separated vocals provide the strongest basis for language recognition. These results demonstrate the potential of classical machine learning for this task, but also highlight its limits,larger datasets and deep learning methods will be needed for stronger multilingual classification in the future.
Restate the project goal, and the goal of your question. What was done in the analysis, and what was found with the features extracted. (1 paragraph)
Nathan #_to_Do
Restate the project goal, and the goal of your question. What was done in the analysis, and what was found with the features extracted. (1 paragraph)
Video links
Nathan #_to_Do
Audio Player Demo
Nathan #_to_Do
sources
Nathan #_to_Do
[1] https://link.springer.com/chapter/10.1007/978-981-97-4533-3_6
[2] https://arxiv.org/html/2411.14474v1